Natural Language and Image-Based Search Support for Recordings#603
Natural Language and Image-Based Search Support for Recordings#603WANGXIAOMIN-HIK wants to merge 11 commits intodevelopmentfrom
Conversation
To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings: FindImagebyNL Purpose: Starts a search session using natural language descriptions to locate relevant video recordings. Example Query: "Person wearing a red hat." Parameters: StartPoint: Start time for the search. EndPoint: End time for the search. RecordingToken: (Optional) Token for the recording to search. Text: Natural language description for the search. Likelihood: (Optional) Similarity threshold for the search (0~1). MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response: SearchToken: A unique reference to the search session. GetNLSearchResults Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL. Parameters: SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response: ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. FindImagebyImage Purpose: Starts a search session using a target image to locate relevant video recordings. Parameters: StartPoint: Start time for the search. EndPoint: End time for the search. RecordingToken: (Optional) Token for the recording to search. TargetImageURI: URI of the target image to be searched. MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response: SearchToken: A unique reference to the search session. GetImageSearchResults Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage. Parameters: SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response: ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. Schema Updates onvif.xsd: Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches. Included fields like TargetImageURI, Time, Likelihood, and RecordingToken. search.wsdl: Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults. Added request and response elements for each operation. Documentation Updates RecordingSearch.xml: Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.
…search Updated document and WSDL definitions to allow multiple recording tokens to be passed in in a search operation to query multiple recordings at the same time.
wsdl/ver10/search.wsdl
Outdated
| <xs:element name="EndPoint" type="xs:dateTime"> | ||
| <xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation> | ||
| </xs:element> | ||
| <xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded"> |
There was a problem hiding this comment.
add maxOccurs="unbounded" which will allow more than one recording container to search.
There was a problem hiding this comment.
Yes, thank you for your opinion. In the last meeting, one of the judges raised the desire to support the search of multiple recording container。
wsdl/ver10/search.wsdl
Outdated
| <xs:element name="EndPoint" type="xs:dateTime"> | ||
| <xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation> | ||
| </xs:element> | ||
| <xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded"> |
There was a problem hiding this comment.
add maxOccurs="unbounded" which will allow more than one recording container to search.
wsdl/ver10/search.wsdl
Outdated
| <xs:annotation><xs:documentation>This element contains a list of recording tokens to search.</xs:documentation></xs:annotation> | ||
| </xs:element> | ||
| <xs:element name="TargetImageURI" type="xs:anyURI"> | ||
| <xs:annotation><xs:documentation>The target image to be searched in LocalStorage URI format.</xs:documentation></xs:annotation> |
There was a problem hiding this comment.
how client gets this local storage URI to search?
There was a problem hiding this comment.
Yes, thank you for your opinion. the TargetImageURI is the result returned from SearchImageByNL.
There was a problem hiding this comment.
cannot we use SearchImagebyImage Request with out getting result from SearchImageByNL? I feel SearchImagebyImage and SearchImageByNL are independent search sessions, i.e one is image based search and other is text based search.
There was a problem hiding this comment.
@WANGXIAOMIN-HIK Yes I agree with @venki5685. I feel there shhould not be a dependence on either API!.
There was a problem hiding this comment.
thank you for your opinion. @venki5685 @kieran242
We think about it carefully, SearchImagebyImage and SearchImageByNL are independent search sessions.
TargetImageURI can be a local URI or a remote URI.
wsdl/ver10/search.wsdl
Outdated
| </xs:element> | ||
|
|
||
| <!-- Define FindImagebyImage --> | ||
| <xs:element name="FindImagebyImageRequest"> |
There was a problem hiding this comment.
FindImagebyImage name can be revisited.
There was a problem hiding this comment.
Yes, thank you for your opinion. We change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.
|
Is this functionality aimed at a Camera device or Network Video Recorder or Both? |
|
Yes, thank you for your opinion. @kieran242 Yes, Both. this functionality aimed at a Camera device and Network Video Recorder. The device implements the following algorithm,algorithm leverages massive annotated image-text pairs during training, where visual features (e.g., "dog", "snow", "yellow” ,“fur") and textual elements (e.g., tokenized "snow/field/dog") are extracted through cross-modal neural networks, forming the foundation of its text-to-image retrieval model. In deployment, the system processes video streams to detect targets, then employs on-device models to generate and store binary-encoded feature vectors for rapid matching. |
|
@WANGXIAOMIN-HIK very kind thanks for your response. It was very informative. |
doc/RecordingSearch.xml
Outdated
| </section> | ||
|
|
||
| <section> | ||
| <title>SerachImagebyImage</title> |
There was a problem hiding this comment.
SerachImagebyImage -> SearchImagebyImage
There was a problem hiding this comment.
Thank you for your clear feedback. I have now fixed the issue.I appreciate your help.
kieran242
left a comment
There was a problem hiding this comment.
@WANGXIAOMIN-HIK a few minor updates as suggestions wsdl update is good but spelling mistake in doc.
doc/RecordingSearch.xml
Outdated
|
|
||
| <section> | ||
| <title>SerachImagebyImage</title> | ||
| <para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para> |
There was a problem hiding this comment.
| <para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para> | |
| <para>SearchImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para> |
doc/RecordingSearch.xml
Outdated
| </section> | ||
| <section> | ||
| <title>GetImageSearchResults</title> | ||
| <para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para> |
There was a problem hiding this comment.
| <para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para> | |
| <para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SearchImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para> |
There was a problem hiding this comment.
Thank you very much for your advice. I have fixed the spelling error. I appreciate your help.
doc/RecordingSearch.xml
Outdated
| <para role="text">The point of time where the search will stop.</para> | ||
| <para role="param">RecordingToken - optional [tt:RecordingReference]</para> | ||
| <para role="text">Token for the recording to search.</para> | ||
| <para role="param">TargetImageURI [xs:anyURI]</para> |
There was a problem hiding this comment.
add a search parameter to accept external image from client in addition to NPL Target image URI. Either client can use NPLTargetImageURI or External Image from client for image search feature.
There was a problem hiding this comment.
We Update the search functionality, add the TargetImageData parameter.
…nd provide detailed explanation of the use of TargetImageURI
doc/RecordingSearch.xml
Outdated
| <para role="param">TargetImageURI [xs:anyURI]</para> | ||
| <para role="text">The TargetImageURI is the result returned from SearchImageByNL.</para> | ||
| <para role="param">TargetImageURI - optional [xs:anyURI]</para> | ||
| <para role="text">The URI of the detected target object image. This can be either: - a local image stored in the NPL Target Image repository (LocalStorage format), or - an external image provided by the client for image search or feature matching.</para> |
There was a problem hiding this comment.
@WANGXIAOMIN-HIK please add an entry for "NPL" to this documents "Definitions" in section 3.1 to explain that it is "Natural Language Processing". It will add clarity in the document.
… the terminology, as the original meaning refers to the images stored internally on the device.
…ult to FindObjectImageResult.
the description :It represents the cosine similarity between two vectors, which is used to measure the similarity of the directions of the vectors. The closer the value is to 1, the higher the similarity; the closer the value is to 0, the lower the similarity.
doc/RecordingSearch.xml
Outdated
| <para role="text">Token for the recording to search.</para> | ||
| <para role="param">Text [xs:string]</para> | ||
| <para role="text">Natural language description for the search.</para> | ||
| <para role="param">CosineSimilarity - optional [xs:float]</para> |
There was a problem hiding this comment.
CosineSimularity is just one out of a set of common simularity measures.
I can imagine that ONVIF just defines an abstract simularity bewteen zero and one or a complex item supporting multiple similarity measures.
For the sake of simplicity I prefer the first approach as the second one would require a set of capabilities which ones a device supports.
There was a problem hiding this comment.
yes,
Cosine similarity is just one of the most commonly used similarity measures in vector space. In actual image retrieval/similarity assessment, multiple distance measures, matching based on local features, perceptual similarity, as well as learned metrics or hash/quantization indexing are also used.
Could we consider changing the field back to 'simularity', but in the comments, we can note that we are currently using the cosine vector method? @dstafx
There was a problem hiding this comment.
Ok for me since this is a big and important topic that we likely need to come back to on a more general level. But to explain a bit more:
My general concern (and the industry challenge) is that if we are too generic it will not be useful across vendors as the implementation scores would not be compareable. The message to the client if it is a generic number is likely that every enpoint where this interface is offered may have different implementations so the similarity is not comparable between different endpoints. So when searching some devices or vendors may consistently report higher similarities thereby potentially hiding relevant results. By stating that the similarity is only for sorting search results from a single endpoint we can avoid this.
There was a problem hiding this comment.
Thank you very much for your detailed explanation. After careful consideration, I still think it should be defined as similarity, as we cannot restrict the vendors' implementation algorithms.
However, regarding the issue you mentioned about cross-device and cross-vendor search, we can add a note stating that this similarity is only guaranteed to be effective for ranking within the same search result returned by the same endpoint, and it does not guarantee that similarity can be compared across different vendors, devices, or endpoints.
As for the cross-device search issue, we can discuss it in our next meeting. This would require imposing constraints on the hardware vendors' implementation mechanisms, such as ensuring that devices support the same algorithm scheduling to guarantee consistency in device detection mechanisms.
|
@WANGXIAOMIN-HIK @dstafx @HansBusch is this issue resolved regarding the "similarity measures" ? I see it is required for IPR| review. |
kieran242
left a comment
There was a problem hiding this comment.
@WANGXIAOMIN-HIK approved with discussion on VE WG Call.
|
It appears that both APIs, GetImageSearchResults and GetNLSearchResults, currently return identical results. We could either consolidate them into a single unified method for search results, |
|
GetNLSearchResults and GetImageSearchResults are distinguished by their usage scenarios. |
@WANGXIAOMIN-HIK , If you believe it makes sense to keep them separate, I’d recommend using different result formats. This would allow each interface to evolve independently. (For example, if we later add parameters specific to image-based search, they wouldn’t automatically apply to natural language search results. ) |
… of the interface. We have added FindNLSearchResultList and FindNLSearchResult to distinguish the return values of the GetNLSearchResults and GetImageSearchResults interfaces. GetNLSearchResults -> FindNLSearchResultList, FindNLSearchResult GetImageSearchResults -> FindObjectImageResultList, FindObjectImageResult
|
Thank you for your suggestion. @sujithhanwha |
To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings:
FindImagebyNL
Purpose: Starts a search session using natural language descriptions to locate relevant video recordings. Example Query: "Person wearing a red hat."
Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. Text: Natural language description for the search. Likelihood: (Optional) Similarity threshold for the search (0~1). MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetNLSearchResults
Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. FindImagebyImage
Purpose: Starts a search session using a target image to locate relevant video recordings. Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. TargetImageURI: URI of the target image to be searched. MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetImageSearchResults
Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. Schema Updates
onvif.xsd:
Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches. Included fields like TargetImageURI, Time, Likelihood, and RecordingToken. search.wsdl:
Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults. Added request and response elements for each operation. Documentation Updates
RecordingSearch.xml:
Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.